A short variational proof of equivalence between policy gradients and soft Q learning
نویسندگان
چکیده
Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.
منابع مشابه
Equivalence Between Policy Gradients and Soft Q-Learning
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and Q-learning methods. Q-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the Q-values they estimate are very inaccurate. A partial explanation may be that Q-learning methods are secretly implementing policy g...
متن کاملP-V-L Deep: A Big Data Analytics Solution for Now-casting in Monetary Policy
The development of new technologies has confronted the entire domain of science and industry with issues of big data's scalability as well as its integration with the purpose of forecasting analytics in its life cycle. In predictive analytics, the forecast of near-future and recent past - or in other words, the now-casting - is the continuous study of real-time events and constantly updated whe...
متن کاملReinforcement Learning by Value Gradients
The concept of the value-gradient is introduced and developed in the context of reinforcement learning, for deterministic episodic control problems that use a function approximator and have a continuous state space. It is shown that by learning the valuegradients, instead of just the values themselves, exploration or stochastic behaviour is no longer needed to find locally optimal trajectories....
متن کاملBridging the Gap Between Value and Policy Based Reinforcement Learning
We establish a new connection between value and policy based reinforcementlearning (RL) based on a relationship between softmax temporal value consistencyand policy optimality under entropy regularization. Specifically, we show thatsoftmax consistent action values satisfy a strong consistency property with optimalentropy regularized policy probabilities along any action sequence...
متن کاملBayesian Deep Q-Learning via Continuous-Time Flows
Efficient exploration in reinforcement learning (RL) can be achieved by incorporating uncertainty into model predictions. Bayesian deep Q-learning provides a principle way for this by modeling Q-values as probability distributions. We propose an efficient algorithm for Bayesian deep Q-learning by posterior sampling actions in the Q-function via continuous-time flows (CTFs), achieving efficient ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1712.08650 شماره
صفحات -
تاریخ انتشار 2017